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The practical exploitation of the vast numbers of sequences in the gen- 
ome sequence databases is crucially dependent on the ability to identify 
the function of each sequence. Unfortunately, current methods, including 
global sequence alignment and local sequence motif identification, are 
limited by the extent of sequence similarity between sequences of 
unknown and known function; these methods increasingly fail as the 
sequence identity diverges into and beyond the twilight zone of sequence 
identity. To address this problem, a novel method for identification of 
protein function based directly on the sequence-to-structure-to-function 
paradigm is described. Descriptors of protein active sites, termed "fuzzy 
functional forms" or FFFs, are created based on the geometry and confor- 
mation of the active site. By way of illustration, the active sites respon- 
sible for the disulfide oxidoreductase activity of the glutaredoxin/ 
thioredoxin family and the RNA hydrolytic activity of the T { ribonuclease 
family are presented.. First, the FFFs are shown to correctly identify their 
corresponding active sites in a library of exact protein models produced 
by crystallography or NMR spectroscopy, most of which lack the speci- 
fied activity. Next, these FFFs are used to screen for active sites in low- 
to-moderate resolution models produced by ab initio folding or threading 
prediction algorithms. Again, the FFFs can specifically identify the func- 
tional sites of these proteins from their predicted structures. The results 
demonstrate that low-to-moderate resolution models as produced by 
state-of-the-art tertiary structure prediction algorithms are sufficient to 
identify protein active sites. Prediction of a novel function for the gamma 
sub unit of a yeast glycosyl transferase and prediction of the function of 
two hypothetical yeast proteins whose models were produced via thread- 
ing are presented. This work suggests a means for the large-scale func- 
tional screening of genomic sequence databases based on the prediction 
of structure from sequence, then on the identification of functional active 
sites in the predicted structure. 
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Introduction 

The Human Genome Project began with the 
specific goal of obtaining the complete sequence of 
the human genome and determining the biochemical 
nature of each gene. To date, the project has been 
quite successful, with sequencing of the human 
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genome about 1.2% complete 0. Roach, http:// 
weber.u. washington.edu / ~ roach / human_genome_ 
progress2.html; Gibbs, 1995), and is on track for its 
scheduled completion in the year 2005. Further- 
more, the genomes of 14 organisms have been 
sequenced and published, including Mycoplasma 
venitalium (Fraser et ah, 1995), Methanococcus jan- 
naschii (Bult et aL, 1996), Haemophilus influenzae 
(Fleischmann et aL, 1995), Escherichia coli (Blattner 
et aL, 1997) and Saccharomyces cerevisiae Mewes 
et a/.,' 1997). Significant progress has been made in 
mapping and sequencing the genomes of model 
eukaryotic organisms, such as mouse, Caenorhabdi- 
tis elegans and Drosophila melanogaster. 

One of the goals of the genome project is to 
develop tools for comparing and interpreting the 
resulting genomic information (Collins & Galas, 
1993). Researchers must learn where each gene lies 
and must understand the function of each gene or 
gene product: is the nucleotide sequence a regulat- 
ory region? Does the nucleotide segment produce a 
gene product? Is the product active as an RNA or 
a protein molecule? What function does the gene 
product perform: does it bind to another molecule, 
is it important for regulation of cellular processes, 
does it catalyze a chemical reaction? The import- 
ance of answering these questions has led to 
research efforts directed towards understanding or 
describing the function of each sequence, particu- 
larly for protein sequences and open reading 
frames (ORFs). Most often functional analysis is 
done by sequence comparison to proteins of 
known structure or function; however, because of 
the lack of sequence similarity, these methods fail 
on about half of the sequences available in the 
sequence and genome databases (Delseny et aL, 
1997; Dujon, 1996). Other approaches to function 
prediction include comparison of the complete 
(Himmelreich et aL, 1997) microbial genomes 
sequenced thus far and an analysis of gene cluster- 
ing (Himmelreich et aL, 1997: Tamames et aL, 
1997). Some have proposed experimental methods 
to accomplish aspects of function prediction on a 
genome-wide basis (Fromont-Racine et aL, 1997; 
Sakaki, 1996). Here, in contrast, we present a novel 
method for protein function prediction based on 
the sequence-to-structure-to-function paradigm, 
where the protein structure is first predicted from 
the sequence, then the active site is identified 
within the predicted structure. Thus, this method 
requires only knowledge of the protein primary 
sequence. As will be demonstrated, enzyme active 
sites can be specifically identified in structures pro- 
duced by state-of-the-art prediction algorithms 
where the atomic coordinates are not well defined. 

Sequence alignment methods for 
function identification 

The most common method of function identifi- 
cation from, just the sequence is global or local 
sequence alignment. This technique is based on 
finding the extent of sequence identity between a 



given sequence and another whose function is:^S'i 
known. Significant sequence identity is a stron^^B, 
indicator that the proteins probably have similar^S 
functions. Alignment methods such as BLASTT&%P 
(Altschul et aL, 1990), BLITZ (MPsrch; Sturrock 
Collins, 1993), and FASTA (Pearson & Lipman;^tf# 
1988), among others, are currently the most poweri%^£ 
ful techniques for analyzing the many sequences70fo^ 
found in the genome databases. Today's methodsptfM 
are robust, fast and powerful for determining tfie^^^^ 
relatedness of protein sequences, particularly whea||t|| 
the sequence identity is above 30% and tne' : ^^ 
relationship between proteins is unequivocal. 



Limits to sequence alignment methods 

A major problem with sequence alignment ril ..^ 9¥ . 
methods for analysis of protein function arises||||£ 
when the sequence similarity goes below the twi^§|gfi 
light zone of 25 to 30% sequence identity. Cur^^p? 
rently available programs cannot consistently^!^ 
detect functional and structural similarities whenj||||- 
the sequence identity is less than 25% (Hobohm.&^^^ 
Sander, 1995). Matches with 50% amino acid iderW|g^ 
tity over a 40 residue or shorter stretch of sequence. :v|g 
regularly occur by chance and relahonships-g^l 
between such proteins must be viewed with cau^|gpg£ 
tion, unless other information is available (Pearson|gg||| 
1996). In the worst case, protein sequences or ORIgg|§g 
do not return significant matches to any sequence^^g 
irr the database. For instance, experiments showe£]|p| 
that an ORF from an intron in a cyanobacterium^| 
tRNA (Biniszkiewicz et aL, 1994) was found to P^ff^ 
duce a protein with endonuclease activity, but n ?|ff|| 
significant match to known proteins was retunied||||| 
from sequence database searches (D. A. Bonoco ^^p^ r 
& R. P. Shub, personal communication). With m ?:^j|p 
exponential growth in the number of available;-p^^ 
sequences from the genome sequencmg projectpg|| 
increasing numbers of sequences cannot be all ^^S|f|| 
with certainty to known proteins on the basts °Ug||p 
their sequence alone, and this limits the ability *^§||§ 
assign a function to these sequences. 

Functional identification using local 
sequence motifs 

To overcome some of the problems associateav^g 
with employing sequence alignments to cleterrrane^g||r 
protein function, several groups have developeav^. 
databases of short sequence patterns or rnoW^^ 
designed to identify a given function or activity <pg^ 
a protein. These databases, notably Prosite (http:/V^||^ 
expasy.hcuge.ch/sprot/prosite.html; Bairoch et 
1995), Blocks (http://www.blocks.fhcrc.or^^p; 

(Henikoff & Henikoff, 1991) and Prints 

www.biochem.ucl.ac.uk/bsm/dbbrowser/PRlN 

S/PRINTS.html; Attwood & Beck, 1994; Attwoog|p 
& Beck, 1994; Attwood et aL, 1994, 1997), use shp^g;^ 
stretches of sequence information to ld ^ in ^ : .|^ 
sequence patterns that are specific for a given i^Sp^f- 
tion; thus, they avoid the problems arising ^. 
the necessity of matching entire sequences. I rote .. to5 
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function can be identified by either a single, local 
sequence motif or a set of local motifs. Typically, a 
local sequence pattern is developed by first identi- 
fying the functionally important residues frbm a 
literature search. A set of proteins that are known 
to belong to the family are aligned and, on this 
basis, the minimal local sequence signature is 
developed. This signature is then tested against the 
sequence database and, if false positives are found, 
the sequence alignment is used to identify con- 
served residues that are then added to the signa- 
ture. This process is iterated until a local signature 
of some specificity is derived. The Prints and Blocks 
databases use multiple alignment representations 
to improve the specificity. Developers of the Blocks 
database have automated the procedure for produ- 
cing patterns (Henikoff & Henikoff, 1991). Any 
newly determined sequence can be rapidly com- 
pared to these dictionaries of patterns in Prosite, 
Prints and Blocks, and if any matches are found, 
the new sequence can be assigned to the corre- 
sponding functional family. In practice, these 
approaches are quite successful. As a result of their 
utility and power, the Prosite, Prints and Blocks 
databases are regularly used by the scientific com- 
munity. 

Conservation of three-dimensional structure in 
protein active sites 

While use of sequence signatures for protein 
function prediction is very powerful, they still fail 
to identify protein function for a variety of reasons, 
all of which, in principle, stem from the fact that 
the chemistry required for the functionality of pro-, 
tein active sites arises from their three-dimensional 
structure. Thus, as sequences diverge, only those 
residues required for the chemistry of the protein 
activity will be absolutely conserved. The structure 
of these active-site residues in three-dimensional 
space should also be conserved. In general, local 
sequence motifs will be unable to recognize such 
conserved three-dimensional structure, especially if 
it involves residues that are non-local in sequence. 
Although the Prints (Attwood et al, 1994) and 
Blocks (Henikoff & Henikoff, 1991) databases have 
attempted to circumvent this problem by develop- 
ing multiple local sequence signatures for a given 
functional family, the three-dimensional structure 
of the active site is still not represented by these 
one-dimensional sequences. But, it is the three- 
dimensional structure of active site residues that is 
explicitly conserved, as illustrated by the following 
examples. 

The three-dimensional structure of urease was 
recently compared to those of adenosine deami- 
nase and phosphotriesterase (Holm & Sander, 
1997b). Previous one-dimensional sequence com- 
parison had failed to detect any relationships 
between these proteins; however, comparison of 
their three-dimensional structures showed conser- 
vation of local structure around the active site, 
although the global folds are different. This same 



active-site geometry was then observed in an even 
larger family of enzymes, with an even greater 
diversity of overall tertiary structure, that are 
involved in nucleotide metabolism (Holm & 
Sander, 1997b). The geometry of the active site 
would not be recognized by local sequence signa- 
tures or by overall comparison of global tertiary 
structures, but only from an analysis of the struc- 
ture of the functional residues around the active 
site. In another example, an analysis of the ribonu- 
cleotide reductases from archaebacteria, eubacteria 
and eukaryotes shows that critical cysteine resi- 
dues in the catalytic domain of this enzyme are 
conserved across all organismal boundaries (Tauer 
& Benner, 1997). However, once again based on 
sequence alignment .alone, the ribonucleotide 
reductases are not obviously related. 

The more divergent the sequences are, the more 
difficult it is to show a familial functional relation- 
ship just by sequence comparison, even if the cata- . 
lytically important residues are invariant. At the 
limit, proteins with completely different structures 
can have similar functions. The bacterial and 
eukaryotic serine proteases, having very different 
protein structures and very similar active sites 
(Branden & Tooze, 1991), illustrate this point. Local 
sequence signatures would be unable to recognize 
these proteins as belonging to the same functional 
family because there would be no sequence simi- 
larity other than the identity and relative orien- 
tation of the specific active-site residues, which are 
non-local in sequence. 

Thus, based on the above data, one must ident- 
ify the global fold of a protein and the specific geo- 
metric arrangement of the active-site residues. In 
other words, one needs to determine both the glor 
bal fold and the local structure of those residues 
that are functionally important. Local sequence sig- 
natures, although very powerful, may not be able 
to recognize the active-site residues, because 
sequence information is inherently one-dimen- 
sional, while protein active sites are inherently 
three-dimensional. But, a method based on identi- 
fying the conserved structure found in protein 
active sites could easily recognize the active-site 
residues and could classify such proteins as 
belonging to a given functional family. 

The sequence-to-structure-to-function 
paradigm and its application to 
function prediction 

In what follows, we describe such a method for 
identification of protein function based on the 
sequence-to-structure- to-function paradigm. We 
make the reasonable assumption that three-dimen- 
sional information is important to the chemistry of 
protein function; therefore, the active-site structure 
of the residues responsible for that function will be 
conserved and we can identify it. In this spirit, we 
develop three-dimensional descriptors of specific 
protein functions, termed fuzzy functional forms 
or FFFs, based on the geometry, residue identity, 
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and conformation of protein active sites. These 
FFFs are based on known crystal structures of 
members of the functional family and on exper- 
imental data available from the literature. The idea 
is similar in concept to that of Hellinga and 
Richards, who developed three-dimensional 
descriptors of metal-binding sites in order to intro- 
duce novel binding sites into proteins (Hellinga 
et al, 1991; Hellinga & Richards, 1991). Instead of 
making the descriptors overly specific, however, 
we explore how much they can be relaxed (i.e. 
made "fuzzy") while still specifically identifying 
the correct active sites in a database of known 
structures. We then show that these fuzzy func- 
tional descriptors developed on the basis of protein 
models can identify protein active sites not only 
from experimentally determined structures, but 
also from predicted protein structures provided 
either by ab initio folding algorithms or by thread- 
ing algorithms. Thus, low-to-moderate resolution 
structures produced by current structure predic- 
tion algorithms are sufficient to identify active sites 
in these models. These results should allow us to 
significantly extend' the analysis of functional 
families further into and beyond the twilight zone 
of sequence similarity, and should allow a more 
extensive functional analysis of the rapidly 
expanding genomic databases. 

Here, the disulfide oxidoreductase activity of the 
glutaredoxin/thioredoxin family and the RNA 
hydrolytic activity of the T, ribonuclease family are 
presented as illustrations and proof-of-principle of 
the method. First, however, to illustrate the need 
for a new approach, we discuss the problems aris- 
ing when local sequence signatures are used to 
identify the disulfide oxidoreductase activity of 
the glutaredoxin/thioredoxin family. Next, we 
describe the development of the FFF for this 
activity and demonstrate its specificity in identify- 
ing active sites in exact protein models. We then 
show that the FFF can specifically identify active 
sites in low-to-moderate resolution models pro- 
duced by either ab initio folding or threading 
algorithms. Based on the application of the glutare- 
doxin/thioredoxin FFF to a threading model, a 
prediction of a novel active site in the gamma sub- 
unit of yeast glycosyl transferase and prediction of 
the active sites for two hypothetical yeast proteins 
whose functions have not been previously ident- 
ified by either the Prosite, Prints or Blocks data- 
bases are described. Finally, to demonstrate that 
the result is not exclusive to the glutaredoxin/ 
thioredoxin family, we present some results for the 
RNA hydrolytic active site of the T { ribonuclease 
family. 

Results 

Analysis of the performanc of local sequence 
motifs for identifying function 

As mentioned in the Introduction, local sequence 
signatures designed for function identification 



become increasingly less specific as the number of'^f^ 
sequences within a protein family increases. To.^^J 
illustrate this point more fully, we performed ajv^^j 
analysis of the Prosite database (Release 13.0^^4 
November, 1995). All instances of true posihvef?l^|j 
false positive and false negative sequences, as?!f||£| 
identified by the Prosite developers, for eadfe|^^ 
family were collected and the results are plotted iif:l^^ 
Figure 1. These data clearly demonstrate that loca|&||t 
sequence signatures perform quite well on many5^|fp; 
families, especially when the number of sequences;^!* 
found in a family is low. Of the 1152 patterns.' irfR^^ 
this release of Prosite, 908 (79%) of the pattern^^^ ^ 
were specific for their sequences (using the set of i$Sf£ 
true and false positives and negatives as identified^^^ 
by the Prosite developers). However, as the num^|l|% 
ber of observed instances of a local patterr$||||^ 
increases, the number of false positives also tend^ffgv|: 
to increase. For 10.5% of the patterns, 90 to 99% of^§| 
the selected sequences were true positives, whilev^| 
for the remaining 10.5% of the patterns, less than)S^^ 
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Figure 1. Data for 1152 patterns found in the P|*osite^|| 
database (Release 13.0). Fraction is the fraction ot ^f^£ 
positives (open circles), false positives (filled dtamoncl%;^^ 
or false negatives (X) found out of the total number 
pattern occurrences. True positives, false positives an 
false negatives are those identified by the Prosite de ™£:£&jfi4 
pers. A, All the data for 1152 patterns; B, the same data Jc^g: 
with an expanded view of the x-axis. " ;,*<gjp 



1 



Protein Function Prediction 



953 



I 

I 

I 
if 

fo- 
il 



■I' 



90% of the selected sequences were true positives 
(Figure 1). 

However, the results illustrated in Figure 1 are 
not the entire story. The data in Figure l;?were 
compiled only from those true and false positives 
and negatives identified by the Prosite developers. 
But identification of true and false positives and 
negatives can be ambiguous and can differ among 
the Prosite, Blocks and Prints databases. The cur- 
rent Prosite database (updated September 10, 1997) 
lists 111 true positives, five false positives and 
one false negative for its thioredoxin sequence 
signature (PS00194). The five false positives 
(YNC4_CAEEL, and POLG proteins from four 
poxyviruses) are not found by the thioredoxin 
sequence signature in either Blocks or Prints 
(Table 1). Three other proteins (FIXW_RHILE, 
GSBPJZHICK and RESA.BACSU) are identified as 
true positives in the Prosite database. Whether or 
not they are, in fact, truly thioredoxins, they are 
variously classified by Prints and Blocks (Table 1). 

Database searches can reveal other sequences 
likely to belong to the thioredoxin family that are 
not listed in the sequence motif databases. For 
example, a keyword search of SwissProt (Bairoch 
& Apweiler, 1996) via the Sequence Retrieval Sys- 



tem (SRS) at EMBL (http://www.embl-heidel- 
berg.de/srs5) using the word "thioredoxin'' 
revealed seven additional sequences (Table 1) that 
were identified as being thioredoxins or probable 
thioredoxins by the depositors of these sequences. 
These sequences are variously classified by Prosite, 
Prints and Blocks (Table 1). One sequence, 
Y039_MYCTU, is not recognized by any of these 
motif databases, demonstrating that some proteins 
possibly belonging to the thioredoxin family are 
not found by any of the local sequence signature 
databases. These results point out the need to 
enhance or improve the identification of function 
from protein sequence so as to be able to comple- 
tely analyze the exponentially increasing genomic 
databases. 

Finally, experimental evidence can suggest other 
proteins that might belong to the thioredoxin 
family. YME3_THIFE is a hypothetical 9.0 kDa 
protein in the MOBE 3' region (ORF 8) in Thiobacil- 
lus ferrooxidans. A clone containing this gene is able 
to complement an E. coli thioredoxin mutant 
(Rohrer & Rawlings, 1992), thereby providing 
some experimental evidence that this hypothetical 
protein might fall into the glutaredoxin/ thioredox- 
in family. A blast search of a non-redundant 



Table 1. Classification of possible thioredoxin sequences by the 
Prosite, Prints and Blocks motif databases 



Prosite 



I 



is 



Sequence recognized by 
Prints 



Blocks 



A. Sequences inconsistently classified by the three motif databases 
FIXWRHTLE X X 
CSBP_CHICK X X X 
RESA_BACSU X X(2) a X 

B. Sequences found by keyboard search of SwissProt for "thioredoxin" 
DSBC_HAEIN " * X 
THIO CHLLT X(2) a X 
THICfCHRVI X X 
THIO RHORU X 
YX09JVTYCTU X 
Y039_MYCTU 

YB59_HAEIN X 

C. Sequences with some experimental evidence 

YME3 THfFE b X 



I- 

k 

i 



D. False positives found by Prosite 
YNC4 CAEEL X 
POLC~PVYC X 
POLG_PVYN X 
POLG PVYHU X 
POLG~PVYO X 



Prosite, recent Prosite database online; thioredoxin examples updated 
9/10/97; http://expasy.hcuge.ch/cprot/prosite.html; Bairoch et at. 
(1995). 

Prints, search of PWL26.0 database; http://www.biochem.ucl. 
ac.uk.bsm/dbbrowser/PRINTS/PRINT5.hrml; Bleasby et ai (1994). 

Blocks, search of Swiss Prot32; http://www.blocks.fhcrc.org; Bairoch 
& Apweiler (1996). 

* Prints uses three different sequence signatures to recognize the 
thioredoxins; (2) means that this sequence was recognized by only two 
of the three signatures. 

b A pi asm id in E. coli expressing this gene product complements a 
thioredoxin mutant, providing experimental evidence that this protein 
may be a glutaredoxin or thioredoxin (Rohrer & Rawlings, 1992). 



sequence database (Genbank CDS translations, 
PDB SwissProt and PrR; http://www.ncbi. 
nlm.nih.gov/BLAST/blast-databases.html) using 
YME3_THIFE as the search sequence produced two 
significant matches and two sequence twilight zone 
matches. The significant matches are to a periplas- 
ms hydrogenase from D. vulgaris (PHFL_DESVO) 
and an open reading frame (ORF-R5) from 
Anabaena. One of the two twilight zone matches 
is to GLRX_METTH, a glutaredoxin-like protein 
from Mdhanobacterium thermoautotrophicum. GLRX_ 
METTH itself exhibits significant sequence simi- 
larity to a number of thioredoxins. Examination of 
the sequence alignment between GLRX_METTH 
and YME3_THIFE shows that the active-site 
cysteine residues are conserved. Thus, this 
hypothetical protein, YME3_THIFE, is very weakly 
similar to GLRX_METTH and probably belongs to 
the glutaredoxin/thioredoxin family. Even though 
YME3_THIFE can be identified by weak sequence 
similarity and there is experimental evidence that 
it belongs to this family, the sequence is not ident- 
ified as such by Prosite, because it contains only a 
portion of either the glutaredoxin or thioredoxin 
Prosite signatures (Figure 2). Furthermore, it is not 
found by Prints, but is classified as a glutaredoxin 
(because of the weak sequence similarity found by 
the BLAST alignment to GLRX_METTH) by Blocks 
(Table 1). These examples further illustrate the 
need to expand the ability to identify protein func- 
tion from sequence, the power and utility of these .. 
sequence signature databases notwithstanding. 

Development of a FFF for the disulfide 
oxidoreductase activity of the glutaredoxin/ 
thioredoxin protein family 

The glutaredoxin/thioredoxin protein family is 
composed of small proteins that catalyze thiol-dis- 
ulfide exchange reactions via a redox-active pair of 
cysteine residues in the active site (Yang & Wells, 
1991a,b). While glutaredoxins and thioredoxins cat- 
alyze 'similar reactions, they are distinguished by 
their differential reactivity. Glutaredoxins contain a 
elutathione-binding site, are reduced by gluta- 
thione (which is itself reduced by glutathione 
reductase), and are essential for the glutathione- 
dependent synthesis of deoxyribonucleotides by 
ribonucleotide reductase (Holmgren & Aslund, 
1995). In contrast, thioredoxins are reduced directly 
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by the specific flavoprotein thioredoxin reductase '* 
and act as more general disulfide reductases 
(Holmgren & Bjornstedt, 1995). Ultimately, how- 
ever, reducing equivalents for both proteins come 
from NADPH. Protein disulfide isomerases (PDls^Hfg 
have been found to contain athioi^oxin-lik^:;^' v 
domain and thus have a similar activity (Kemmink 
etal, 1995, 1997). 

The active site of the redoxin family contains "Sj> T 
three invariant residues: two cysteines and a cis-^$i 
proline. Mutagenesis experiments have shown that. 
the two cysteine residues separated by two resi-^l^ 
dues are essential for significant protein hmction.i^J^ 
The side-chains of these two residues are oxidized 
and reduced during the reaction (Bushweller elat^Mz 
1992; Yang & Wells, 1991a). These two cysteinei^t^ 
residues are located at the N terminus of an{|^ 
a-helix. Peptide studies have suggested that the l^g 
positive end of the helix macrodipole affects me"";":?; 
ionization of the cysteine residues and is thus con- Jgr 
jectured to be important for protein function (Kor-: 1 ^ 
temme & Crieghton, 1995, 1996), although ^ 
alternative views have been expressed (Dyson et «/.;■ 
1997). Another unique feature of the redoxm 
family is the presence of a czs-proline residue r ^ 
located close to the two cysteine residues in struc-S;^ 
ture, but not in sequence. While this proline resi- ; v r^ 
due is structurally conserved in all glutaredoxin^^ 
and thioredoxin structures (Katti et al, 1995) and is.|vif; 
invariant in aligned sequences of known glutare-/^ 
doxins and thioredoxins, its functional Importance,^ - 
is unknown. Other residues, particularly charged,^ 
residues, have been shown to be important for the >pt 
specific thiol characteristics of the cysteine resi-^^p 
dues, but are not essential and can vary within the^ 
family (Dyson et al, 1997). 

The FFF for the disulfide oxidoreductase activify^^ 
of the glutaredoxin/thioredoxin family was built.;if|g 
as described in Methods and outlined in Figure^3^^ 
The literature information incorporated into * e ^g 
FFF is described above. The structure of the achve^g- 
site was taken from the three-dimensional structuring 
al comparison of bacteriophage T4 glutaredoxirv^ 
laaz (Eklund et al, 1992), human thioredoxin, 4trx^ 
(Forman-Kay et al, 1990) and disulfide bond W\j0 
mation protein, ldsb (Martin et al, 1993). The^|| 
superposition of the active sites of these three pn^^, 
teins is shown in Figure 4A, with the a-carbon dis-^gfv 
tances between the relevant residues used to create^ 
the FFF shown in Table 2. The set of a-carbon dis-^g 
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Multiple sequence 
alignment to identify 
c nserved residues (may 
include several different 
alignments from families 
having similar functions) 



Search literature t 
identify residues 
imp rtant f r function 



Examine functi naliy 
related structures 



Identify residues nonlocal 
in sequence that are 
functionally important 



Superimpose structures with 
respect to functionally important residues 



Extract set of distance and conformational 
criteria to define the FFF 



no 



Use FFF to screen high resolution 
crystal and solution structures 
Is FFF specific and unique? 



no 



Use FFF to screen low resolution structures of known 
proteins created by threading and ab initio methods 

Is FFF specific and unique? 

^yes 



no 



Use FFF to screen low 
resolution predicted models 
Is FFF specific and unique? 
^yes 




Figure 3. Outline of the protocol for producing a FFF. Information from library searches, multiple sequence align- 
ments, and examination of tertiary structures are used to identify residues that are functionally important. From 
there, the FFF is defined and validated first on high-resolution crystal and solution structures, then on low-to-moder- 
ate resolution models of known structures, and finally on predicted models. At each of these steps, we ask whether 
the FFF is unique and specific for the given activity. If it is not, the active-site residues are re-evaluated and 
additional constraints are added to the FFF to make it more specific. This procedure was used to create the disulfide 
oxidoreductase and T t ribonuclease FFFs, as shown by the data presented in Table 2. 



tances that define the FFF and their location with 
respect to the active site are indicated by dotted 
lines. The following FFF was thus developed: two 
cysteine residues separated by two residues and an 
a-carbon distance of 5.5 (±0.5) A. These cysteine 
residues must be close to a proline residue. The 
a-carbon distance from Cys(/) to the proline resi- 
due is 8.5 (±1.5) A and that from Cys(i + 3) is 6.5 
(±1.5) A. These three sets of distances comprise the 
distances-only FFF of the giutaredoxin/ thioredoxin 
family. The definition of the FFF itself and the 
specific geometric information used to create the 
FFF are shown in Table 2. There is some evidence 
that the cysteine residues must be at the N termi- 
nus of a helix because of the effect of the helix 
macrodipole on the sulfhydryl ionization 
(Kortemme & Crieghton, 1995, 1996); however, this 



evidence is disputed (Dyson et al, 1997), so this 
characteristic is applied only if necessary. 

Application of the glutaredoxin/thioredoxin 
FFF to exact protein structures 

The distances-only FFF (Figure 4 and Table 2) is 
almost sufficient to uniquely distinguish proteins 
belonging to the glutaredoxin/thioredoxin family 
from a data set of 364 non-redundant proteins 
taken from the Brookhaven database. For this set 
of 364 proteins, 13 have the sequence signature- 
C-X-X-C-. Of these, only three, Ithx (thioredoxin), 
IdsbA (protein disulfide isomerase, chain A) and 
lprcM (photosynthetic reaction center, chain M) 
have a proline residue within the distances speci- 
fied in Table 2. Of these three, only lthx and ldsb 
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GLR1 ECOLI YCVRAKDLAE KLSHERDDF. . - .QYQYVDI RAEGITKE. . . DLQQKAGKP 
THIO_BPT4 KCVYCDNAKR LLTVKKQPF. . . . EFINIMP EKGVFDDEKI AELLTKLGRD 

THIO HUMAN PCKmTkPFFH SLSE...KYS N.VIFLEVDV D DCQDVASE 

DSBAl ECOLI HCYQFEEVLH I SDNVKKKLP EGVKMTKYHV NFMGGDLGKD LTQAWAVAMA 
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GLR1„EC0LI VE...TVPQI FV.DQQHIGG YTDFAAWVKE NLDA 

THIO BPT4 TQ rGLTMPQV FAPDGSHICG FDQLREYFK 

THIO HUMAN CEVKCMPTFQ FFKKGQKVCE FSGA.NKEK ^TINELVT 

dsba'ecoh lgvedkvtvp lfegvqktqt irsasdirdv pinagikgee ydaawnsfw 
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have their cysteine residues positioned at or near^;W 
the N terminus of a helix. These two proteins are?^ 



GLRl_ECOLI 

THIO_BPT4 ' ' ' 

DSBllECO 1 ^ KSLVAQQEKA AA0VQLRGVP AMFVNGKYQL NPQGMDTSNM DVFVQQYADT 
201 

GLRl_ECOLI 

THIO_BPT4 

THIO_HUMAN 

DSBA_ECOLI VKYLSEKK 

Figure 4. A, Structure of the proteins used to describe 
the disulfide oxido reductase FFF, T4 glutaredoxin, laaz, 
chain A (Eklund et al, 1992; gray ribbon), human thiore- 
doxin, 4trx (Forman-Kay et al, 1990; blue ribbon) and 
proline disulfide isomerase, ldsb, chain A (Martin et al, 
1993; green ribbon), and an enlargement of the active 
site of these proteins. The enlargement shows that the 
active-site structure of these proteins is conserved, 
although the structure of the rest of the proteins is very 
different. The backbone atoms of the two cysteine resi- 
dues and the c/s-proline residue were superimposed and 
the side-chains of the two cysteine residues (yellow) and 
the proline residue (magenta) are shown as black ball 
and stick models. The helix containing the cysteine resi- 
dues is shown in orange and the N-terminal turn of this 
helix is cyan. The two residues on either side of the pro- 
line residues are shown as a gray ribbon. The O-C" 
distances taken from these proteins used to define the 
FFF are shown in Table 2 and are indicated in the 




the only two "true positives" in the test data sefV^ 
thereby showing that this simple FFF is quitl^pi 
specific for identifying the disulfide oxidoreductas£§^ 
active site of the glutaredoxin/ thioredoxin proteir@?^ 
family. When the requirement that the cysteihe'l^Ly- 
residues be at the N terminus of a helix is includedO|§l§i If 
then the lprc-M site is also eliminated, ^^kin^thfp^fe 
FFF absolutely specific for the glutaredoxin /thiore^^^ 1 



doxin disulfide oxidoreductase FFF. 

To explore the "fuzziness" of this acrive-sit|^fft 
descriptor, the allowed variance in the Cys-Pro9^ 
and Cys-Cys ot-carbon distances was ^niforrnl^^l 
increased in increments of ±0.1 A. Upon increasirig:^|g| 
the aLlowed distances by ±0.1 A, lfjm (Goldber^||tt 
' " 

et al, 



, (Goldberg^ 

et al, 1995), a serine/threonine phosphatase, llc#|^ 
1993), a lactoferrin, and lprc-C^ffi| 



(Day et al, 1993), a lactoferrin, and lprc-Cf^ 
(Deisenhofer et al, 1995), the C-chain of the photo^^ 
synthetic reaction center, were also selected by the^^ 
distances-only FFF. The Cys-Cys-Pro site in lfjm is^K 
curiously similar to that found in the glutaredox^^g^ 
in/ thioredoxin family, including the proline resi||^ 
due being in a ds-conformation, but the cysteme^|| 
residues are at the C terminus, not the N terminus^^ 
of a helix, llct, an iron transport protein, contains;a^^| 
proline residue near a cluster of metal-bindiri^^K 
cysteine residues and these are in a very irregiilar^^.. 
structure, not in a helix. In lprc-M, the Cys-Cy^^^. 
Pro structural motif is located along one face of a^^^ 
transmembrane helix, near the C terminus of that^^ 
helix. In lpcr-C, the Cys-Cys-Pro motif is located^^ 
in another very irregular region. Thus, even when^^ 
relaxed or fuzzy descriptors are used, all four pro-Q^ 
teins found by the distances-only FFF are elirrunatec|^S 
when the helix requirement is included. When the|||| 
distance constraints are relaxed even further; fpjg| 
±0.3 A, only one other protein, 2fd2 (Soman fl|^|| 
1991), a ferredoxin, is selected. Ferredoxin/Tfe^^l 
another metal-binding protein. Again, the cysteine:^.; 
residues are found in a non-regular stmcrural|^ 
region, not in a helix. It is important to note 
when the cysteine residues are required to be.\at|gg| 
minus of a helix, all of these false ?os^^ 



the N terminus or a neux, an or mese r-^g^ 
tives are no longer recognized, even when the F^|fg| 
distance constraints are further relaxed by ± 0 - 3 ^<p£f 
from the distances and their allowed varian( ?^^ 
shown in Table 2. 



. . ■ — — 77&y®$* 

Figure by dotted lines. B, Sequence alignment of Igj 
laaz (THIOJ3PT4), Idsb, chain A (DSBA_ECOLl) 
4trx (THIO_HUMAN), the three proteins used to cr ^|Mg 
the FFF, and lego (GLRLECOLI), the protein who|^ 
structure was predicted using the MONSSTER ^ ^ 
folding algorithm, The two cysteine residues and me c |^^ 
proline residue involved in the active site and S P^^^ 
cally selected by the FFF are shown in bold and un ?^^0 
lined. The Figure was created using Pileup (W»r^ons^ ^ 
Package, Version 8, 1994, Genetics Computer Group 
Comparison of A and B shows that the active-site 
dues in these proteins are conserved in the struc 
but are not close in the sequence. 
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Residues involved 
in FFF 



Distance in training prc/tci ns 



A. Disulfide oxidoreductase FFF of the $lutaredoxins/thiorrdoxins 

1aazA laazB IdsbA IdsbB 
Cys(i') Pro 7.93 8.15 7.47 

Cys(/+3) Pro 5.01 5.17 5.61 

Cys(/) Cys(H-3) 5.27 5.21 5.13 



7.70 
5.28 
5.55 



4trx 
10.55 
6.27 
5.54 



s 

ft'; 

fi 

I 



jp. 
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B, RNA hijdrolijtic 


FFF of the T, ribonuclcases 








Irtu 


lfus 


lrms 


His 


His* 


15.20 


16.70 


15.97 


His 


Glu 


5.36 


5.84 


5.71 


His* 


Glu 


13.03 


12.90 


21.44 


Tyr 


Phe 


13.03 


16.40 


16.62 


Tyr 


Arg 


10.50 


10.20 


10.25 


Phe 


Arg 


9.61 


9.34 


9.40 


His 


Tyr 


4.87 


5.02 


5.13 


His 


Phe 


14.47 


15.60 


15.28 


His 


Arg 


10.44 


11.30 


10.94 


His* 


Tyr 


16.06 


16.10 


15.86 


His* 


Phe 


4.67 


4.60 


4.63 


His* 


Arg 


872 


8.79 


8.50 


Glu 


Tyr 


7.36 


7.10 


7.13 


Glu 


Phe 


21.17 


11.80 


11.77 


Glu 


Arg 


6.33 


6.16 


5.87 



Mean 


SD 


FFF definition 
FFF-DIST- FFF-Var- 


Test 
structure 










lego 


8.360 


1,25028 


8.5 


1.5 


7.48 


5.468 


0.49932 


6.5 


1.5 


5.60 


5.350 


0.18097 


5.5 


0.5 


5.33 










9rnt 


15.950 


0.6577 


15.9 


1.5 


15.63 


5.637 


0.2221 


D./ 


0.5 


5.79 


12.580 


0.2773 


12.6 


1.0 


11.95 


16.580 


0.1290 


16.5 


0.5 


16.43 


i n xxv\ 




10.3 


0.5 


10.29 


9.450 


0.1418 


9.5 


0.5 


9.59 


5.007 


0.1305 


5.0 


0.5 


5.07 


15.120 


0.5866 


15.2 


1.0 


15.28 


10.900 


0.4366 


11.0 


1.0 


11.16 


16.010 


0.1286 


15.80 


1.0 


15.32 


4.633 


0.0351 


4.6 


0.5 


4.64 


8.670 


0.1513 


8.6 


0.5 


8.48 


7.197 


0.1422 


7.2 


0.5 


7.24 


11.900 


0.2309 


11.9 


0.5 


11.96. 


6.120 


0.2326 


6.1 


0.5 


6.00 



The residues in each family used to define the FFF are shown in Figures 4A and 7B. In the glutaredoxin/thioredoxin family, two 
cysteine residues separated by two residues and a proline residue are used. In the T t ribonuc leases, six residues are used: the nucleo- 
philic triad consisting of two histidine residues, His and His*, with the former (latter) closer to the N(C) terminus and a glutamic 
acid residue, and the transition state for stabilization triad consisting of a tyrosine, an arginine and a large hydrophobic residue 
(phenylalanine in lrtu, lfus and lrms). For the T, ribonucleases FFF, the first three lines are the nucleophilic triad; the next three 
lines are for the transition state stabilization triad; and the last nine lines are the distances between the two triads. Exact C*-C* dis- 
tances for the relevant residues in the proteins used to define each FFF are presented (laazA, laazB, IdsbA, IdsbB and 4trx for the 
glutaredoxin/thioredoxin family, and lrtu, lfus and lrms for the T, ribonuclease family). Mean is the mean of the a-carbon dis- 
tances (ui A) found in these structures; SD is the standard deviation for this distribution of distances. The columns FFF-DIST and 
FFF-Var are the data that describe the FFF: DIST is the C-O distance and Var is the variation allowed in these interatomic dis- 
tances. In most cases, DIST is close to the mean distance found in proteins used to define the FFF (and is derived using the set of 
364 PDB structures and is chosen to eliminate false positives, and Var is correlated with the standard deviation for the distribution 
of distances found in the same set. In the last column, the distances for (A) lego (Xia et at., 1992) and (B) 9rnt (Martinez-Oyanedel 
et til., 1991), the test proteins that were not used to define the FFF but were used to test its definition, are given for comparison. 



Application of the glutaredoxin/thioredoxin 
FFF to predicted protein models produced by 
an ab initio folding algorithm 

Current ab initio protein structure prediction 
algorithms can often generate inexact models of 
proteins or protein fragments with a 3 to 6 A back- 
bone coordinate root-mean-square deviation 
(cRMSD) from the native structure (Aszodi et al, 
1995; Friesner & Cunn, 1996; Mumenthaler & 
Braun, 1995; Ortiz et al, 1998; Smith-Brown et al, 
1993; Srinivasan & Rose, 1995). Is the glutaredox- 
in/thioredoxin FFF sufficient to identify the active 
site of such an inexact model of a protein, or is 
a high-resolution crystal or solution structure 
required? The structure of E. coli glutaredoxin, 
lego (Xia et al, 1992), was predicted with a 5.7 A 
cRMSD of the a-carbon atoms by the MONSSTER 
algorithm (Skolnick et al, 1997). Furthermore, the 
sequence of this glutaredoxin exhibits less than 
30% sequence identity with any of the three struc- 
tures used to create the FFF (Figure 4B). The disul- 
fide oxidoreductase FFF was applied to 25 
"correct" structures and 56 "incorrect" or "mis- 



folded" structures generated by MONSSTER for 
the lego sequence during the isothermal runs. The 
distances-only FFF specifically selected all 25 ego- 
like structures as belonging to the redoxin family 
and rejected all 56 misfolded structures. A set of 
267 correctly and incorrectly predicted structures 
produced by the MONSSTER algorithm for five 
other proteins was then created. The distances-only 
glutaredoxin/thioredoxin FFF was specific for the 
correctly folded ego structures and did not recog- 
nize any of the correctly or incorrectly folded struc- 
tures of these other proteins. Inclusion of the more 
specific criterion that the cysteine residues be at 
the N terminus of a helix did not change these 
results. 

To further explore the allowed fuzziness of the 
FFF as applied to these inexact models, the dis- 
tance constraints were again relaxed. When the 
allowed variance in the a-carbon distances shown 
in Table 2 was relaxed by an additional ±0.2 A, 
the FFF was still absolutely specific for all correctly 
folded lego structures. When the variance was 
relaxed to ±0.3 A, the distances-only FFF picked 
up two of the 56 misfolded lego structures, in 



addition to the 25 correctly folded structures. 
When the allowed variance was further relaxed to 
±0.5 A, no additional incorrectly folded structure 
was selected. These results demonstrate the tpeti- 
ficity and the uniqueness of the glutaredoxin/ 
thioredoxin disulfide oxidoreductase FFF for low 
resolution models of protein structure by ab initio 
folding algorithms. 

Application of the glutaredoxin/thioredoxin 
disulfide oxidoreductase to models built using 
threading or inverse folding algorithms 

The FFF concept can be applied to proteins that 
have been folded by an ab initio folding algorithm, 
but such state-of-the-art algorithms are too slow to 
permit genome-wide screening. Thus, for large- 
scale screening, it would be most useful if the FFFs 
could be applied to three-dimensional protein 
models produced by threading or inverse folding 
algorithms. To be useful for genome-wide screen- 
ing, the procedure must recognized proteins that 
could not be detected by standard sequence anal- 
ysis methods. Thus, we applied the disulfide 
oxidoreductase FFF to several putative proteins 
from the yeast genome database (Mewes et al, 
1997). The selected protein sequences were aligned 

Table 3. Results of the application of threading and the 
eight sequences from the yeast genome database 



to a database of 301 non-homologous protein struc- ^N§P 
hires (Fischer et al, 1996) using an inverse folding i^SM 
or threading algorithm (Jaroszewski et al, 1998)/.|f)^M; 
Models were built using automatic scripts, a^^J 
described in Methods. Without further relaxation^iS^ 
these models were screened using the glutaredox-^t|| 
in /thioredoxin FFF. 



A total of eight ORFs from the S. cerevisiae gen^l3| if 
ome database were tested (Table 3). Six were iderit^|f^2^fe 
ified by the combination of threading and FFF ais^§$ 
containing the disulfide oxidoreductase active sitep^g 
one protein is predicted to belong to the proteih?^^|^ 
disulfide isomerase family (S67190); one sequencej^^p 
that the depositors identify as a hypothetical thior4jpft E 
edoxin (YCX3_YEAST); one sequence, which fiafg||p 
no detectable sequence similarity to any g lu tare^|p^M: 
doxin or thioredoxin, identified as the gamma subS|^;fe 
unit of glycosyl transferase (OSTG_YEAST); art^^^p; 
three hypothetical proteins, one having very dis-^|^^: 
tant sequence similarity to glutaredoxin from rice^^ f| 
(S51382), one with very distant sequence similari^|2ft^^ r 
(insignificant by the Blast score) to the glutaredoxin^^g- 
from Methanococcus thermoautoformicunt (S70116);^;||| 
and one with no similarity to any glutaredoxin 
or thioredoxin by Blast (YBR5_YEAST). Of thesecfsS 
six, only YCX3.YEAST, S67190, and S70116 were£||| 
identified as a glutaredoxin or thioredoxin by. a||^| 



glutaredoxin/thioredoxin disulfide oxidoreductase FFF %:^| 



II 



Sequence Blast PS P(FS) P(B) B 



Aligns 



Signif. 



Active-site res. 



Name 



YCX3.YEAST X X X X X 



S67190 



S70116 



S51382 



X X X X X 



YBR5_YEAST 
OSTG_YEAST 

YE04_ YEAST X 

YPR082c X 



2trxA 
2trxA 
2trxA 

2trxA 
2trxA 
2trxA 

lego 

2trxA 

2trxA 

lego 

2trxA 
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IdsbA 

2trxA 
2trxA 
2trxA 

2trxA 
2trxA 
2trxA 

2trxA 
2trxA 
2trxA 



48.1 
498.3 
101.2 

5464.9 
1736.8 
2200.6 

7.6 
17.6 
135 

7.8 
175 
16.0 

29.0 

25.6 
437 
95.8 

175.6 
934.3 
730.3 

8.9 
23.1 
23.3 



C55,C58,P98 Hypothetical thrx-like protein 



C59,C62,P105 

C31,C34,P79 

C25,C28,P74 

C13,C16,P151 
C73,C76,P133 



NF 
NF 
NF 

NF 
NF 
NF 



MPD1 prot 



Hypothetical protein 



Hypothetical protein 



Hypothetical protein 
Glycosyl transferase y subunit 

Hypothetical protein 

Hypothetical protein 



SB 

... .t&S\f~\& 

Silt 



the moHfe; ; v&"; 



The first five columns show if the function of the sequence could have been method; PPkt! 

S^doxm/thiorldoxin FFF as being active site residues. NF means that, although the sequence ahgned with legu , ... 
IdsbA, the FFF did not identify a disulfide oxidoreductase active site in this protein. . ,= 
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least one of the motif databases. Two additional 
sequences, YE04_YEAST and YPR082c, were 
identified by the threading algorithm as having the 
glutaredoxin or thioredoxin structure, but were not 
identified by the FFF as containing the oxidoreduc- 
tase active site. Both of these sequences exhibit 
identifiable sequence similarity to glutaredoxins or 
thioredoxins by the Blast algorithm (Table 3). 

The threading algorithm (Jaroszewski et al, 
1998) aligns the sequences of all eight of these 
ORFs to the structure of either lego (£. coli glutare- 
doxin (Xia et al, 1992)), 2trx, chain A (£. coli thiore- 
doxin (Karri et al, 1990)), or ldsb, chain A (E. coli 
protein disulfide isomerase (Martin et al, 1993)) 
from a database of 301 non-homologous proteins 
(Fischer et al, 1996). The alignment fit is strong, as 
seven of the eight sequences were matched to 
lego, 2trx or ldsb by all three scoring methods 
used to assess the significance of the threading 
results (Table 3). For comparison, the significance 
scores reported in Table 3 can be compared to the 
distribution of significance scores for all yeast 
sequences that aligned to 2trx, chain A (Figure 5). 
Models were built based on the sequence-to-struc- 
ture alignments and were screened with the FFF. 
Sixteen of the models (one model for each scoring 
method for YCX3_YEAST, S67190, S70116, S51382, 
YBR5_YEAST and OSTG_YEAST) were found to 
have the disulfide oxidoreductase active site 
described by the distances-only FFF. The residues 
predicted to be in the active sites of these proteins 
are listed in Table 3. 



This result is remarkable when one considers 
that the sequence similarity between these proteins 
is virtually non-existent. Several examples are 
shown in Figure 6. In one case (S70116), standard 
multiple sequence alignments even fail to correctly 
align the proposed active-site residues when the 
sequences are aligned with each other (Figure 6B). 
The Prosite, Prints and Blocks motif databases var- 
iously classify these proteins (Table 3). One 
sequence (S570116) is recognized only by the 
Blocks database; another sequence (S51382) is not 
recognized by any of these sequence motif data- 
bases. Furthermore, BLAST sequence alignment 
algorithms do not find a match for YBR5_YEAST 
or OSTGJVEAST to any glutaredoxins or thiore- 
doxins. Thus, the result for these two sequences 
stands as a prediction of activity based on the dis- 
ulfide oxidoreductase glutaredoxin/ thioredoxin 
FFF. 

Finally, it must be shown that not all sequences 
recognized by the threading algorithm contain the 
disulfide oxidoreductase active site. In theory, 
threading algorithms should recognize structure 
only; consequently, they should be able to recog- 
nize proteins with similar structures, but that do 
not have the same function. Such proteins are 
termed topological cousins. The FFF should allow 
us to distinguish between functionally related pro- 
teins and topological cousins. Two of the sequences 
(YE(34_YEAST and YPR082c) are found to align 
with 2trx, chain A, by all three scoring methods 
used by the threading algorithm (Table 3). Com- 
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Figure 5. The distribution of significance scores for al! sequences from the yeast genome that aligned with 2trx, 
chain A, by the threading algorithm. The main graph shows the distribution for significance scores of 1 to 200; the 
inset graph shows the distribution for all significance scores. Comparison of scores presented in Table 3 and this 
histogram shows that sequences found to contain the disulfide oxidoreductase active site and those that do not con- 
tain the active site have similar significance scores. Thus, application of the FFF allows us to automatically distinguish 
between proteins with similar active sites and those that are topological cousins. The threading algorithm and three 
different scoring methods (filled, open and cross-hatched bars) are described in detail by Jaroszewski et al (1998) and 
briefly described in Methods. 
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A 1 50 

GLR1_EC0LI KQTVIFGRSG CPYCVRA.KD LAEKLSNE. . 

SSI 3 82 MSAP VTKAEEMIKS HPYFQLSASW CPOCVYAHS. IWNKLNV. . . 

S70116 MLRAFRCSIH TSRVLLHDAG VKLTFFSKPN CCLCDQAKEV IDDVFERKEF 

d 

51 * 100 

GLRl.ECOLI RDD. . . FQYQ YVDIRAEGIT KEDLQQKAGK PVETVPQIPV D. . QQHIGGY 
S51382 QDKVFVFDIG SLPRNEQEKW RIAFQKWGS R. .NLPTIW N. .GKFWGTE 
S70116 HHKAVSLEIV NITDRRNAKW WKEY CFOIPVLHI EKVCDPKSCT 

101 123 

GLRl_ECOLI TOFAAWVK EN LDA 

S513B2 SQLHRFEAKG TLEEELTKIG LLP 
S70116 KILHFLEEDD ISDKIRRMQS R. . 

B 1 50 

THIO_ECOLI SDK IIHLTDDSFD TDVLKADGAI 

YCXsIyEAST .MLFYKPVMR MAVRPLKSIR FQSSYTSITK LTNLTEF... RNLIKQNDKL 
S67190 MLFLNIIKLL LGLFIMMEVK AQNFYDSDPH ISELTPKSFD KAIHNTNYTS 

S513B2 MSAFVT KAEEMIKSHP 

S701L6 H LRAFRCSIHT SRVLLHDAGV 

51 100 

THIO_ECOLI LVDFWA EWCGPCK.MI APILDEI • AD EYOGKLTVAK LNIDQNPG. . 

YCX3_YEAST VIDFYA TWCGPCK.MM QPHLTKL. IQ AYPD. . .VRF VKCDVDES. . 

S67190 LVEFYA PWCGHCK.KL SSTFRKA.AK RLDGWQVAA VNCDLHKN. . 

S513 82 YFQLSA SWCPDCV.YA NSIWNKLNVQ DKVFVFDIGS LPRNEQEKWR 

S70116 KLTFFSKPNC GLCDQAKEVI DDVFERKEFH NKAVSLEIVN ITDRRNAKWW 

101 150 

THIO_ECOLI TA. . PKYGIR GIPTLLL. . . FKNGEVAATK VGALSKGQLK EFLDANLA. . 

YCX3_YEAST PDIAKECEVT AMPTFVL. . . GKDGQLIGKI IGANPTALEK GIKDL 

S67190 KALCAKYDVN GFPTLMV. . . FRPPKIDLSK PIDNAKKSFS AHANEVYSGA 

S51382 IAFQKWGSR NLPTIWNGK FWGTESQLHR FEAKGTLEEE LTKIGLLP. . 

S70116 KEYC.-FDIP VLHIEKVGDP KSCTKILHFL EEDDISDKIR RMQSR 

Figure 6. Sequence alignment of four proteins from 
the Saccharomyces cerevisiae genome with each other and 
with the target structures lego (Xia et ai, 1992; GLRL 
ECOLI) and 2trx (Katti et al, 1990; THIO.ECOLI; see 
Table 3. The threading program matched S51382 and 
S70116, both hypothetical proteins, to lego (A); the 
threading program matched YCX3_YEAST / a hypotheti- 
cal thioredoxin-like protein, S67190, a protein that is pre- 
dicted to be related to the protein disulfide isomerases, 
S51382 and S70116 to 2trx (B). (Different scoring func- 
tions matched S51382 and S70116 to different templates, 
thus they are shown in both alignments; see Table 3). In 
B, only the first 150 residues of S67190 that align with 
2trx are shown. The cysteine and proline residues ident- 
ified by the FFF as being part of the disulfide oxido- 
reductase active site (see Figure 4A and Table 2) are 
shown in bold type. Alignments were produced by the 
Pileup multiple sequence alignment program (Wisconsin 
package, Version 8, 1994, Genetics Computer Group) 
using the standard parameters. 



parison of the scores reported in Table 3 to the dis- 
tribution of scores shown in Figure 5 demonstrates 
that these are relatively strong structural predic- 
tions. Furthermore, both of these putative proteins 
are found to align to thioredoxins by the BLAST 
algorithm (Table 3), although the significance score 
of these sequence alignments is not high. However, 
the disulfide oxidoreductase FFF does not find the 
correct active site in the three-dimensional models. 
Analysis of the sequences by hand after the fact 
demonstrates that neither sequence has the CXXC 
sequence characteristic of the active site of this 
family, thus it is unlikely that these proteins would 
demonstrate the oxidoreductase activity. We have 
thus shown that not all proteins that align with 




...» 

lego, 2trx or ldsb by the threading algorithm exhtS^h. 
bit the disulfide oxidoreductase active site. 

Taken together, these results demonstrate tl^WS 
models produced by threading algorithrns : are^lSt 
sufficient for application of the FFF to the identiSllW 
fication of active sites in proteins. Application^ ^fSII 
the FFF idea is an automatic method for dis?^^^ 
tinguishing between functionally related proteinsfffe 
and topological cousins. These results suggest't^ft 
means for large-scale functional analysis of ftet^ftl 
genome databases using the sequence-to-shnichir^^pl 
function paradigm. ^^^^ 

Development of a FFF for the 



T\ ribonucleascllf^ 



protein family 

To show that the FFF concept is applicablelto^SI 
other active sites besides the glutaredoxm/thioleSS^ 
doxin disulfide oxidoreductase active site, a .FF^^^ 
was built for the active site of the T, ribonuclease^^l 
family. The T x ribonucleases are a family of p^7llSJ 
teins that include a number of ribonucleases sucftM^ 
as T x , T 2 , U 2 and F v and the distantly relai^l^S. 
family of fungal ribotoxins. These protems;; ; are^^| 
endoribonucleases that are generally specific;i-^rS^i 
purine, particularly guanine, bases (Steyaert, 199^|^|| 

The catalytic mechanism of the T L ribonucleaaS^^^ 
has been well studied (for a review, see Steya^t^^^ 
1997). Two histidines and a glutamic acid residue^^^ 
are essential for the nucleophilic displacementVqi^^^ 
the phosphate atoms. A tyrosine, a phenylalarAuie|^^ 
(or another large hydrophobic residue) and?an§||§j 
arginine residue are responsible for stabilizing -the^^^ 
transition state of the reaction. These catalytic, r^§|^^ 
dues are located on various strands across one facelift! 
of a (3-sheet. They are highlighted in the rnuitigli^^^ 
sequence alignment of T l ribonucleases shown -;iri*^^ 
Figure 7 A, and their proximity in three-dirhefc|J|^ 
sional space is shown in Figure 7B. Neither Prosii^^^p 
Prints nor Blocks provides a sequence signafi^|^^ 
with which to identify this family. " S^^^p 

An analysis of three T, ribonucleases whose'Mll! 
structures have been solved (Irms, Nonaka e ^ ai ^^^ 
1993; lfus, Vassylyev et al, 1993; and 
Noguchi et al, 1995) shows that the location 
active-site residues in three-dimensional space^^^p. 
very well conserved. Thus, a FFF based on the-dig^^ 
tances between appropriate a-carbon atoms ^ffggfll 
developed from these distances, plus or n^V??!^^ 1 
small variance. The exact values used to create *|)|^"; 
FFF and the distances and variances for the " F F^^| 
itself are given in Table 2. V^^ltt^ 




Application of the T, ribonuclease FFF to 
"exact" protein structures 

When applied to three-dimensional stmctu S$^|l 
the T, ribonuclease FFF was implemented in 
stages: first the structure is searched for the res £^i^ i 
triad involved in nucleophilic displacement 'ffj^^^ 
His-Glu); second, the structure is searched ^^^f 
residue triad involved in transition state stabil^|^ 
arion (Tyr-Hydrophobic-Arg); third, if both tnaOS^^ 
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1 50 

RNTl_ASPOR HHYSKLL TLTTLLLPTA LA L PS L VERA CDYTCGSHCY 

RNPl_FUSHO Q SATTCGSTNY 

RNMS_ASPSA ES CEYTCG0TCY 

RNU2_USTSP CHIP ESTNCGGNVY 

RMC2_ASPCL D CDYTCGSHCY 

RNPB_PENBR A CAATCGTVCY 

RNPC_PENCH A CAATCGSVCY 

RNN1_NEUCR .' A CMYICGSVCY 

RNU1_USTSP QGG VSVNCGGTYY 

RNAS_ASPGI MVAIKNLVLV ALTAVTALAV PSPLEARAVT WTCLNDQKNP KTNKYETKRL 
RNCL_ASPCL HVAIKKLVLV ALTAVTALAM PSPLEERAAT WTCMNEQKNP KTNKYENKRL 
RNMG_ASPRE MVAIKNLFLL AATAV5VLAA PSPLDARA.T WTCINQQLNP KTKKWEDKRL 



RNTl_ASPOR 
RNF1_FUSH0 
RNMS_ASPSA 
RNU2_USTSP 
RNC2_ASPCL 
RNPB_PENBR 
RNPC_PENCH 
RNN1_NEUCR 
RNU1_USTSP 
RNAS_ASPCI 
RNCL_ASPCL 
RNMG.ASPRE 



RNTl_ASPOR 
RNFl_FUSMO 
RNMS.ASPSA 
RNU2_USTSP 
RNC2_ASPCL 
RNPB_PENBR 
RNPC_PENCH 
RNN1_NEUCR 
RNU1_USTSP 
RNAS_ASPGI 
RNCL_ASPCL 
RNMC_ASPRE 



SI 

SSSDVSTAQA 
SASQVRAAAN 
WSSDVSAAKA 
SNDDIWTAIQ 
SASAVSDAQS 
TS5AISSAQA 
TSSAISAAQB 
SSSAISAALN 
SSTOVNRAIN 
LYNQNXAESN 
LYNQNNAESN 
LYSQAKAESN 



AGYQLHEOGE 
AACQYYQHOD 
KGYSLYESGD 
GA . . . LDDVA 
AGYQLESAGQ 
AGYNLYSTND 
AGYDLYSAND 
KGYSYYEDGA 

NA KSG 

SHHAPLSDGK 
AHHAPLSDGK 
SHHAPLSDGK 



TVGSNSTPHK 
SAGSTTTPBT 
TI . . DDYPBG 
RPDGDNTPKQ 
SVGRSRTPSQ 
DV . . SHTPBE 
OV. .SHTPBE 
TAGSSSTPHR 
QYSSTGTPHT 
T. . GSSTPHW 
T. .GSSTPHW 
T. .GSSTPHW 



100 

YNN . YEG PDF 

YNN.YEG FDP 

YHD.YEG FDF 

YYD.EAS EOI 

YRN.YEG FHF 

YHN , YEG FDF 

YRN.YEG PDF 

YNN.YEG FDF 

YNN.YEG FDF 

FTNGYDGDGK LPKGRTPIKF 
FTNGYDGDGK ILKCRTPIKW 
FTNGYDGNGK LIKGRTPIKF 



101 

. .S.VSSP YYWPILS 

. .P.VDGP YQEFPXKS 

. .P.VSCT YYMYPIKS 

TLCCG PCS WSHFPLVY 

. .P.VSGN YYEWPILS 

. . P.7SGT YYBFPILK 

. .P.VSGT YYBFP1LR 

. .P.TAKP WYBFPILS 

S . DYCDGP YKKYPLKT 

GKSDCDRPPK HSKDGNGKTD HYLLKFPTFP 
GNSDCDRPPK HSKNGDGKND HYtXKFPTFP 
GKADCDRPPK HSQNGKGKDD HYLLBFPTFP 



SGDVYS. .G. 
GG.VYT. -G. 
DYDVYT. .G. 
NGPYYS. . SR 
SGSTYN. .G. 
SGKVYT. .G. 
SGAVYS. .G. 
SGRVYT. -G. 
SSSGYT. .G. 
DGHDYKFDSK 
DGHQYNFDSK 
DGHDYKFDSK 



150 
. . .GSPGADR 
. . . GSPGADR 
. . .GSPGADR 
DNYVSPGPDR 
. . . GGPGADR 
. . . SSPGADR 
. . . NSPGADX 
. . .GSPGADR 
. . .GSPGADR 
KPKENPGPAR 
KPKEDPGPAR 
KPKEDPGPAR 



RNT1. 
RNF1. 
RHMS. 
RNU2. 
RMC2. 
RNPB. 
RNPC_ 
RNN1. 
RNUl. 
RNAS. 
RNCL. 
RNHG. 



ASPOR 
FUSMO 
.AS PSA 
.USTSP 
ASPCL 
.PENBR 
.PENCH 
NEUCR 
.USTSP 
ASPGI 
.ASPCL 
.ASPRE 



151 

WFNENNQ . L 
WINTNCE.Y 
VIFNGDDB.L 
V1YQTNTGEF 
WFNDNDE . L 
VIFNDDDE.L 
WFNGNDQ . L 
VIFDSHGN.L 
WYDS NDGTP 
VIYTYPNKVF 
VIYTYPNKVF 
VIYTYPNKVP 



AGVITHTGAS 
AGAITHTGAS 
AGVITHTGAS 
CATVTHTGAA 
AGLITHTGAS 
AGVITHTGAS 
AGVITHTGAS 
DHLITHNGAS 
CGAITHTGAS 
CGIIAHTKEN 
CGIVAHTREN 
CGIVAHQRGN 



182 

G.NNFVECT. .. 
G . NNTVGCSG TN 
G.DDFVACSS S 
SYDGFTCCS. 
G.DGTVACY. 
G . NNFVACT . 
G.NNFVACD. 
G . NNFVACN . 
G . NNrVQCSY 
Q.GELKLCSH 
Q . GDLKLCSH 
Q.GDLRLCSH 




Figure 7. A, Sequence alignment of nine ribonucleases 
and three ribotoxins in the T, family. The first nine 
sequences are ribonucleases; the last three are ribotoxins 
with their leader sequences attached (RNAS_ ASPGI, 
ot-sarcin; RNCL.ASPCL, clavin; RNMG_ASPRE, restric- 
tocin). The six residues involved in the catalytic mechan- 
ism are shown in bold. The first four sequences are 
those found in the PDB database and were used to cre- 
ate or to test the FFF (see Tables 2 and 4). The other T, 
ribonucleases were found by searching the SwissProt32 
database (Bairoch & Apweiler, 1996) with BLAST 



i 



are found, the relative positions of the two triads 
are checked based only on the distances between 
a-carbon atoms. Application of the FFF to the 364 
non-homologous PDB protein structures yields 
only one structure that contains both residue triads 
in the correct juxtapositions: 9rnt (Matinez-Oyane- 
del et al, 1991), the only true positive in the test 
data set. Increasing the allowed variation for each 
distance by ±0.5 A yields no additional protein, 
demonstrating that this FFF is specific for struc- 
tures of the 1 X ribonuclease family solved to atomic 
resolution, even when the distance restraints are 
made increasingly fuzzy. 

Application of the J A ribonuclease FFF on 
low-to-moderate resolution protein models 

To test the applicability of the T, ribonuclease 
FFF to inexact, predicted models, the nine ribonu- 
clease sequences listed in Table 4 and Figure 7 A 
were threaded through 301 non-homologous pro- 
teins, as described in Methods. All nine sequences 
were matched as the highest score to the 9rnt struc- 
ture by all three scoring methods. Models were 
built for all 27 (nine sequences times three scoring 
methods) sequence-to-structure alignments and all 
27 models were screened by the T l ribonuclease 
FFF. All 27 models were found to contain both 
ribonuclease active-site triads in the correct 
locations in the structure (Table 3). 

To test the method on more distantly related 
sequences, models of three ribotoxin sequences 
were built. Ribotoxins are a small family of pro- 
teins found in the Aspergillus fungi family. They 
cleave RNA in the ribosome, thus inactivating the 
ribosome and ultimately killing the cell (Kao & 
Davies, 1995). The RNA cleavage is carried out 
by a mechanism quite similar to that found in the 
T, ribonucleases (Campos-Olivas et al., 1996). The 



(Altschul et al, 1990) using RNTl_ASPOR as the search 
sequence. The ribonucleases were selected as the most 
significant matches to RNTl_ASPOR. All sequences, 
both ribonucleases and ribotoxins, were aligned using 
the Pileup multiple sequence alignment tool from the 
Wisconsin GCG package. B, A view looking along the 
approximate plane of the p-sheet that contains the active 
site of 9rnt (Martinez-Oyanedel et al, 1991). The His- 
His-Glu involved in nucleophilic displacement are 
shown as magenta ball and stick models; the Tyr- 
Hydrophobic-Arg side-chains involved in transition 
state stabilization are shown as light green ball and 
stick models. The strands of the sheet (as identified by 
the crystallographer) are shown as a light blue ribbon. 
The set of a-carbon distances that define the FFF corre- 
sponding to the His-His-Glu distances are indicated by 
the magenta lines and the corresponding distances for 
those residues involved in transition state stabilization 
are shown by green lines. Comparison of A and B 
shows that the active-site residues are close in three- 
dimensional space, but not close in the one-dimensional 
sequence. 
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Table 4. Results of application of threading and the T, ribonuclease FFF to nine ribonuclease 
sequences from various organisms and three ribotoxin sequences from the Aspergillus fungi family 





Signif. 


Active-site Res. (H H E Y Phob R) 


RNT1 ASPOR(9rnt) 


132104. 645277, 4334 


40 92 58 38 100 77 


RNF1 FUSMO(lfus) 


2358, 6673, 428 


30 90 57 37 98 75 


RNMS ASPSA (lrms) 


12586, 38926, 1244 


39 91 57 37 99 76 


RNU2 USTSP(lrtu) 


122, 332, 61.0 


41 101 62 39 110 85 


RNC2 ASPCL 


18638, 88768, 1321 


40 92 58 38 100 77 


RNPB PENBR 


6650, 26930, 810 


38 90 56 36 98 75 


RNPC PENCH 


5977, 21676, 764 


38 90 56 36 98 75 


RNN1 NEUCR 


11840, 47527, .946 


40 92 58 38 100 77 


RNU1 USTSP 


677, 1690, 132 


37 92 57 35 100 76 


RNAS ASPGI (a-sarcin) 


26.4, 72.5, 14.8 


77 164 123 75 172 148 


RNCL ASPCL (clavin) 


26.1, 82.2, 13.3 


77 1 64 123 75 172 148 


RNMG ASPRE (restrictocin) 


21.1, 91.2,15.8 


76 163 122 74 171 147 



Signif. is the significance score of the alignment. The first entry of the group of three is the sequence-based 
scoring method; the second entry is the sequence-structure scoring method; the third is the structure-structure 
scoring method. AH sequences aligned with the 9mt structure. Active-site res. are the residues in the threading 
model that are identified by the RNA hydrolysis T, ribonuclease FFF as being active-site residues. The first 
group (H H E) is the triad involved in the nucleophilic catalysis; the second group (Y Phob R) is involved in 
transition state stabilization (see the text for more details). For the ribotoxins, the residue numbering includes 
the leader sequence, as shown in Figure 7A. 
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three selected ribotoxins, a-sarcin (RNAS_ASPGI), 
clavin (RNCL_ ASPCL) and restrictocin (mitogillin, 
RNMG_ASPRE), can be aligned with the T, ribo- 
nucleases by multiple sequence alignment algor- 
ithms (Figure 7A), but the degree of sequence 
identity between the ribotoxins and the T t ribonu- 
cleases is quite low (less than 35% pairwise 
sequence identity). Furthermore, a Blast (Altschul 
et al, 1990) search of SwissProt (Bairoch & 
Apweiler, 1996) using the sequence of 9rnt as the 
search sequence does not yield any of these ribo- 
toxin sequences. The structure of restrictocin (Yang 
& Moffat, 1996) was solved and recently released; 
that of a-sarcin (Campos-Olivas et al, 1996) was 
solved, but has not yet been released to the public 
databases. 

The three ribotoxin sequences, including , their 
signal sequences, were threaded through 301 non- 
homologous protein structures (Fischer et al, 1996). 
As with the T, ribonucleases, each ribotoxin 
sequence aligned with 9rnt as the highest-scoring 
sequence by all three scoring methods, although 
the alignment scores were much lower than those 
for the T l ribonucleases themselves (Table 4). Nine 
models (three sequences times three scoring 
methods) were built based on the sequence-to- 
structure alignments produced by the threading 
program. All nine models contained both the 
nucleophilic and the transition state stabilization 
triads and were recognized by the T x ribonuclease 
FFF. The identified active-site residues are pre- 
sented in Table 4. This result again demonstrates 
that models of distantly related proteins can be 
built on sequence-to-structure alignments pro- 
duced by a threading algorithm. Active sites 
within these low-to-moderate resolution models 
can be recognized by the FFF. 

None of the T, ribonuclease sequences has yet 
been folded by the MONSSTER algorithm; how- 
ever, as a control, this FFF was tested on a data- 



base of correctly and incorrectly folded structures c $i?^l 
produced by MONSSTER and used to test the glugif|f|| 
taredoxin/thioredoxin FFF. None of these stmc^^ 



tures was found to contain the RNase active®^ 
site, even when the allowed variance in the dis^^lf^ 
tance was uniformly increased by ±0.5 A ovetr^^g ; 
those variances shown in Table 2. 



Discussion 

With the advent of* the genome sequencing pro^j|| 
jects, the number of known protein sequences^is||^^| 
exponentially increasing; however, the sequence^pj^§g||| 
a protein is virtually useless without some knowj^^^^ : 
edge of both its structure and its function. '.^^^^ 
most common methods for predicting protein f^c^J|| 
tion from sequence are to look for homologp^§f|^ 
proteins in the sequence databases by stand^rc^^^ 
sequence alignment protocols, or to look for lpca^^^| 
sequence signatures that match those found "l^^lll^Sl 
appropriate functional databases such as Pros ^§^| 
Blocks and Prints. . '^0100 

Here, we have demonstrated the utility of a n ^^§||; 
method for predicting protein function based ' 
the three-dimensional structure of the active si *^||||J 
This method is based on the sequence-to-sti-ucture^g^ 
to-function paradigm, because the structure of *t^||||l| 
protein is first predicted from its sequence, th^pl^ 
the active site of the protein is identified in the prer|^|^ 
dieted model. We have shown here that acnve-sife^||. 
descriptors, termed FFFs, work to identify the|^^ 
active-site residues both in high-resolution ( exa[< ?!^^i 
and low-to-moderate resolution (inexact or 
dieted) protein structures. 

Advantag s of using geometric descriptors t<^>|p 
identify pr tein activ sites <.f§l||S| 



Because the method is based on 
sional structures of protein active sites, it has tl»^^| 
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following advantages (each is discussed in further 
detail in the following paragraphs): (1) it is appli- 
cable even when the degree of sequence identity 
between two proteins is not significant; (2) it can, 
in principle, treat the case of proteins having two 
different global folds, but similar sites and associ- 
ated function; (3) it distinguishes between proteins 
with similar folds (topological cousins) and those 
that belong to a given functional family; and (4) in 
addition to assigning a given protein to a func- 
tional family, the method produces a map or 
model of the protein active site. 

The examples presented in the Introduction 
suggest that functionally important residues are 
often non-local in sequence and important func- 
tional relationships cannot always be extracted 
from standard sequence comparisons. As the 
sequence and structural databases grow ever lar- 
ger, examples such as these will become increas- 
ingly more common. Motif databases such as 
Prosite (Bairoch et al, 1995), Blocks (Henikoff & 
Henikoff, 1991) and Prints (Attwood & Beck, 1994; 
Attwood et al, 1994, 1997), while very powerful, 
are limited in scope because they are restricted to 
one-dimensional sequence information. FFFs, 
because they are based on the three-dimensional 
structure of the active site, should be able to ident- 
ify the similar function in these families, as was 
shown here for the oxidoreductase activity of the 
glutaredoxin/thioredoxin family (Table 3) and the 
RNA hydrolyric activity of the 1 X ribonucleases 
(Table 4). We have shown several cases where the 
FFF was able to identify an active site when the 
sequence identity between the two sequences was 
in the twilight zone. We have predicted the active 
site and associated activity in one case where the 
sequence identity is insignificant and the function 
is not identified by Prosite, Prints or Blocks. 
Finally, we predicted the activity in two cases where 
neither BLAST nor the motif databases predict the 
glutaredoxin/thioredoxin active site (Table 3). 

In the extreme case, however, two proteins 
might have similar active sites even though their 
tertiary structures are completely different, as 
found for the mammalian and bacterial serine pro- 
teases (Branden & Tooze, 1991). In such a case, 
sequence alignment or local sequence signatures 
would be unable to recogruze the functional simi- 
larity. The method presented here has the advan- 
tage that it should be able to recogruze the active- 
site similarity in such cases. Such cases will be 
examined in the future. 

The third advantage of the method is that it can 
distinguish between topological cousins and pro- 
teins having similar function. The number of 
known folds is not increasing as quickly as the 
number of solved structures. For instance, the 
structural family of the a/p barrel proteins is quite 
large; however, the members of this family can 
have quite disparate functions. This has led some 
researchers to suggest that there are a limited num- 
ber of protein folds (Godzik, 1997; Holm & Sander, 
1997a; Orengo et al., 1994; Wang, 1996), a statement 



that bodes well for prediction of protein structure 
by threading or inverse folding-based approaches. 
However, while this observation might increase 
the chances of predicting a structure via threading, 
it decreases our ability to predict the function of 
that same protein via threading. If many different 
proteins with differing functions fold into similar 
structures, simple structure prediction will tell us 
nothing whatsoever about the function of the pro- 
tein. The results for YE04_YEAST and YPR082c 
demonstrate that the FFF can provide a method for 
automatic distinction between functionally related 
and topological cousins. Thus, a library of FFFs 
would greatly expand the utility of current thread- 
ing algorithms and allow us to predict protein 
function, as well as structure, via threading 
approaches. 

Use of the sequence-to-structure-to-function 
paradigm for prediction of protein function confers 
one further set of advantages: the predicted struc- 
ture produces a model of the protein and the FFF 
identifies the exact location of the active site in 
both the sequence and its predicted structure. In 
contrast, standard sequence analysis methods, 
being inherently one-dimensional, do not automati- 
cally provide a model of the active site. Once the 
protein's function is predicted by sequence hom- 
ology, a model of the active site can be built by 
homology modeling based on the sequence align- 
ment, provided that the structure of a related pro- 
tein is known. In the FFF paradigm, the procedure 
is inverted: first, a model structure is built, then it 
is scanned for possible location of the active site. 
Once the location of the active site is identified, the 
active site and the model can be further studied for 
possible similarities to or differences from other 
active sites of the same protein family. For example, 
once a predicted structure is identified as having the 
redoxin disulfide oxidoreductase activity, the struc- 
ture can be analyzed in more detail to see if it 
belongs to the thioredoxin or the glutaredoxin sub- 
families or to another, as yet unidentified, subfamily. 
Examples of active-site residues predicted by the 
FFFs described here are presented in Tables 3 and 4. 
Finally, the sequence whose function is newly 
assigned can now be used to scan the sequence data- 
bases for other homologous sequences that are com- 
patible with the predicted function. 

Disadvantages of using geometric descriptors 
for identifying protein active sites 

The FFF approach suffers from several disadvan- 
tages. First, a structure of the protein must be 
available. This is not as great a disadvantage as it 
might seem, because we have shown here that 
low-resolution models produced by current predic- 
tion algorithms are still useful for active-site pre- 
diction using this method. Protein structure 
prediction tools and algorithms will only improve, 
but even at this stage, useful models can still be 
produced. A second disadvantage is that the 
resulting model might actually be incorrect, i.e. it 
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is misfolded, either globally or locally. Such an 
incorrect structure could cause misidentification of 
an active site, with either a false positive or a false 
negative result, depending on the particular ease. 
As the method is further tested, such situations 
will undoubtedly be observed. The final and major 
disadvantage is that the active site responsible for 
the protein's function must have been previously 
observed and studied. Otherwise, using the current 
method, it is not possible to build a FFF. However, 
we will start by building a Library of FFFs with the 
many active sites that have been well studied and 
whose structures are known. Even if our approach 
is limited to previously identified active sites, the 
ability to predict proteins that have these active 
sites will still be very useful and will extend the 
limits of function prediction from sequence much 
further into the twilight zone. 

Conclusions and Future Directions 

Here, we have shown that simple, relaxed geo- 
metric and conformational descriptors (FFFs) of the 
active sites of proteins are sufficient to select pro- 
teins containing specific activities from a large set 
of high-resolution models. We have shown that the 
two developed FFFs are specific for low-resolution 
models created either by ab initio folding algor- 
ithms or by threading algorithms. Finally, we have 
presented an example where the FFF predicts the 
function of two proteins from the yeast genome 
whose structures were predicted by a threading 
algorithm and whose function could not be ident- 
ified by local signature databases or by Blast. This 
work increases the utility of the genomic sequence 
databases and demonstrates that predicted models, 
even those at low resolution, can be used for pro- 
tein function prediction. Thus, it paves the way for 
the functional screening of the genomic databases 
based on the sequence-to-structure-to-function 
paradigm, provided that the active-site geometry 
and conformation is found in a previously solved 
structure. 

In the future, we plan to expand the library of 
FFFs to include many more active sites, focusing 
on those activities that are not easily identifiable 
by local sequence motifs. We will include the pep- 
tide hydrolytic active site of the serine proteases in 
the expanded FFF library to show that FFFs can 
identify similar active sites even when the global 
fold is completely different. The threading results 
presented here suggest that screening of complete 
genomes for function might be feasible, and this, 
too, will be attempted. Analysis of complete gen- 
omes will provide more detailed analysis on the 
success and failure rate of this method. 

M thods 

Descripti n of how t build an FFF 

The FFFs are built from the three-dimensional struc- 
tural arrangements of functionally important residues on 



the basis of the biochemistry of the known function. 
These geometric descriptors should be inherently more 
exact than local sequence signatures, because they 
encode structural as well as minimal sequence infor- 
mation and, thus, they will be more descriptive of the 
actual chemistry involved in the protein function. A gen- 
eral outline of how to build a FFF is shown in Figure 3. 

The first step is to perform a literature search to gather 
biochemical evidence about which residues are function- 
ally important. Next, a series of functionally related pro- 
teins with known structures are selected. These putative 
functionally important residues are superimposed in 
space, and their relative geometries (distances, .ingles) 
between a-carbon atoms and side-chain centers of mass 
are recorded. Common secondary structures are ident- 
ified, if there is evidence in the literature for the import- 
ance of such conformations. Structural superposition and 
multiple sequence alignment can help identify other resi- 
dues that might be important, but these should he used 
only if experimental evidence suggests a functional sig- 
nificance. The procedure is iterative. After identification 
of conserved residues, another literature search can be 
done to analyze the relative functional importance of 
these conserved residues and structures. We aim to use 
only those residues shown to be functionally important 
or conserved across a large set of proteins exhibiting the 
activity of interest. 

Once a set of geometric and conformational con- 
straints for a specific function has been identified, they 
are implemented in the form of a computer algorithm. 
The program searches experimentally determined pro- 
tein structures from the protein structural databank 
(Abola et al, 1987) for sets of residues that satisfy the 
* specified constraints. The constraints are implemented 
stepwise, so that structures that are eliminated by each 
criterion can be evaluated at each step along the way. If 
the constraint set misses any proteins known to exhibit 
the function under investigation, the structure of the 
missed protein is analyzed and the FFF modified. If the 
FFF selects proteins that are not known to display the 
function, then the structure of these "false positive" 
examples is compared to the known functional sites. 
Again, the FFF is modified to eliminate the false posi- 
tives, although some false positives could prove to be 
interesting if they identify a previously unrecognized 
activity in a protein. 

At this stage, a tentative FFF is generated that can be 
applied to structures of varying quality (Figure 3). While 
the FFF is initially tuned to high-resolution structures, it 
might be loosened to accommodate ambiguities inherent 
in lower-resolution models. Ideally, such fu/./Jness 
should not degrade the performance on high-resolution 
structures. Thus, the extent of fuzziness is ascertained by 
the performance on exact (i.e. a set of high-resolution) 
structures and on low-resolution models nt known 
structure. 

Using this method, FFFs were created for the disulfide 
oxidoreductase activity of the glutaredoxin/thioredoxm 
family and the RNA hydrolytic activity of the T, nbonu- 
clease family. The information from the literal m e about 
the enzymatic reaction mechanism used to create these 
FFFs is described in Results. For the disulfide oxido- 
reductase FFF, laaz chains A and B (Eklund ct at., 1<*9Z), 
ldsb chain A (Martin et al, 1993) and 4trx (Forman-Kay 
et al, 1990) were used to define the active-site geometric 
information. For the T, ribonuclease FFF, Irtu iNogucru 
et at., 1995), lfus (Vassylyev et al, 1993) and Irrns 
(Nonaka et al, 1993) were used to define the aenvo site. 
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The final a-carbon to a-carbon distances and their 
allowed variances that describe the final FFFs are com- 
piled in Table 2. Most often, these distances are similar 
to the average distances for proteins used to define the 
FFF and the allowed variance correlates with the stan- 
dard deviation (see Table 2 for examples). The FFFs 
themselves and their application to actual and predicted 
structures are described in more detail in Results. 

Description of the threading or inverse 
folding algorithm 

In an inverse folding approach, one "threads" a probe 
sequence through different template structures and 
attempts to find the most compatible structure for a 
given sequence. The threading program used here is that 
created and distributed by Godzik and co-workers 
(Jaroszewski et al, 1998). Briefly, the sequence-to-struc- 
ture alignments are performed by a "local-global" ver- 
sion of the Smith- Waterman algorithm (Waterman, 
1995). The alignments are then ranked by three different 
scoring methods (Jaroszewski et al, 1998). The first, SQ, 
is based on a sequence-sequence type of scoring. In this 
sequence-based method, the Gonnet mutation matrix 
was used to optimize gap penalties, as described by 
Vogt and Argos (Vogt et al, 1995). The second method, 
BR, is a sequence-structure scoring method that is based 
on the pseudo-energy from the probe sequence 
"mounted" in the structural environment in the template 
structure. The pseudo-energy term reflects the statistical 
propensity of successive amino acid pairs (from the 
probe sequence) to be found in particular secondary 
structures within the template structure. The third meth- 
od, TT, is a structure-structure scoring method, whereby 
information from the known template structure is com- 
pared to the predicted secondary structure of the probe 
sequence. The secondary structure prediction scheme for 
the probe sequence was the nearest-neighbor algorithm 
(L. Rychlewski & A. Godzik, unpublished). The version 
used here achieves an average three-state prediction 
accuracy of 74%. 

Once we have computed scores for the sequence-to- 
structure alignments, the statistical significance of each 
score must be determined. To determine this signifi- 
cance, the distribution of scores is fit to an extreme value 
distribution and the raw score is compared to the chance 
of obtaining the same score when comparing two unre- 
lated sequences, as described by Godzik and co-workers 
(Jaroszewski et al, 1998). Tables 3 and 4 report the 
significance score of the top sequence-to-structure 
alignments, rather than the raw score. 

Once the alignment of the probe sequence-to-template 
structure has been determined, a three-dimensional 
model must be built. Scripts utilizing the automatic mod- 
eling tools provided by Modeller4 (Sali & Blundell, 1993) 
were developed (L. Jaroszewski, K. Pawlowski & A. 
Godzik, unpublished). These scripts automatically pro- 
duce all-atom coordinate files for the three-dimensional 
model built from the sequence-to-structure alignment 
provided by the threading algorithm. The FFF was 
applied directly to these structures without any further 
enhancement, energy calculations or molecular mech- 
anics simulations of the model. 

Descripti n f MONSSTER, the ab initio 
folding alg rithm 

Some predicted structures were produced using a 
method for the ab initio prediction of protein structures 



at low resolution (Ortiz et al, 1998; Skolnick et al, 1997). 
Predicted structures used for the FFF analysis were 
taken directly from the set of correctly and incorrectly 
folded proteins produced by this procedure. Briefly, the 
procedure can be divided into two parts: restraint deri- 
vation using information extracted from multiple 
sequence alignment and structure assembly/ refinement 
using an improved version of the MONSSTER algorithm, 
which uses a high coordination lattice-based C* rep- 
resentation for the folding of proteins (Skolnick et al, 

1997) , modified to incorporate the expected accuracy and 
precision of the predicted tertiary restraints (Ortiz et al, 

1998) . 

For each protein sequence, 10 to 40 independent simu- 
lated annealing simulations from a fully extended initial 
conformation were carried out (assembly runs). Struc- 
tures were then clustered and all low-energy structures 
were subjected to low-temperature, isothermal refine- 
ment. The predicted fold is that of lowest average 
energy. The FFFs were tested on a series of correctly and 
incorrectly folded structures produced during the assem- 
bly and isothermal runs for proteins lego (Xia et al, 
1992), lpoh Qia et al, 1993), lubq (Vijay-Kumar et al, 
1987) and lcis (Osmark et al, 1993). 
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